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ABSTRACT 
A  parallel  algorithm  for   the  problem  of   selecting  the  k-th 
smallest   out  of  n  elements  is  presented.   It  runs  in  0(n/p)  time  using 
p  <  n/(log  nloglog  n)  processors  on  an  exclusive-read  exclusive-write 

parallel  RAM. 
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whlle  m  >  p  begin 

(Comment.   The  input  for  each  iteration  of  the  while  loop  is  the   array 

Y  whose  length  is  ra.  The  goal  is  to  find  its  £-th  smallest  element). 

Step  1.   (Assume  w.l.g.   that  m/p  is  an  integer). 

for  processor  i,l  <  i  <  p,  pardo 

Find   the   median   of   subarray  Y[(i-l)m/p  +  l,...,im/p]  in  0(m/p)  time 

using  the  linear  time  serial  algorithm  of  [BFPRT-72] . 

Step  2.   Sort  the  p  medians  found  in  Step  1  in  O(log  p)  time  using  p/2 

of   the  processors   by   the  sorting  algorithm  of  [AKS-83].   Denote  the 

median  of  these  medians  by  a.  'Broadc^ast'  a  to  all  processors. 

(Comment.   This  algorithm  can  be  Implemented  on  an  EREW  PRAM   since   in 

each  of  the  O(log  p)  time  units  of  this  algorithm  <  p/2  disjoint  pairs 

are  compared.) 

Step  3.   Compute  into  s^,    S2  and  S3   the  number  of   elements   smaller 

than,  equal  to  and  larger  than  a,"  respectively.   'Broadcast'  s, ,  so  and 

s^  to  all  processors. 

Step  4.   i^  sj  <  £  <  s^  +  S2   ''  ^^•-■s- 

then  STOP  -  Output  a. 
if_  £  <  Sj 

then  Y  :=  'all  elements  smaller  than  a';  m  :=  s, 
else  Y  :=  'all  elements  larger  than  a';  m  :=  s^;  i  :=  Z    ~(si  +30) 

end 

Step  5.   Sort  the  <  p  elements  of  array   Y   in   O(log  p)   time   by   the 

sorting  algorithm  of  [AKS-83].   Output  the  Jl-th  element. 

Implementation  remarks. 

1.  Each  'broadcast'  can  be  implemented  in  time  O(log  p)  (=  O(log  n)). 
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2.  Array  X  can  be  readily  copied  into  array  Y  in 
0(n/p)  (=  O(log  nloglog  n))  time  in  the  initialization  phase  (recall 
p  =  n/(log  nloglog  n)  ). 

3.  s^  ,  S2  and  s-^  are  computed  in  0(m/p  +  log  m)  each  in  Step  3.  We  show 
how  to  compute  s^.      In  the  summation  algorithm  of  the  Appendix  enter   1 


4.  The  assignment  into  array  Y  in  Step  4  is  computed  in  0(ra/p  +  log  m) 
time  in  either  one  of  the  two  cages  where  such  an  assignment  is 
required.  We  say  how  to  perform,  fthis  assignment  in  the  case  £  <  s, . 
In  the  partial  sura  algorithin  of  the-  Appendix  enter  1  for  each  element 
smaller  than  a  and  0  otherwise.  ^  The  partial  sura  at  each  element 
smaller  than  a  is  its  serial  number  among  elements  smaller  than  a  in 
the  'old'  array  Y  and  its  entry  number  in  the   'new'   array  Y.   The 


Time  complexity.   Initialization  takes  O(log  nloglog  n)  time. 

Let  us  look  at  an  iteration  of  the  while  loop  that  starts  with  an  array 

of  length  ra  for  sorae  m  <  n.   Step  1  takes  0(ra/p)  time.    Step   2   takes 

O(log  p)  time.   Steps  3  and  4  take  each  0(ra/p  +  log  m)  time.   Thus,  the 

total  time  spent  on  this  iteration  is  0(m/p  +  log  n).    Next,   we   show 

that   s^   and   s^   (coraputed   in   Step   3)   are   each  <  3ra/4.    Claim. 

S2  +  s-j  >  m/4.   Proof.   Count  all  medians  (of  Step  1)  which  are   larger 

than   or   equal   a   and   all   the  elements  in  their  subarrays  which  are 

larger  than  or  equal   these   medians.    The   claim   implies   s,  <  3ra/4. 

Proving  that  s^  <  3ra/4  is  similar. 

Let  ra^  be  the  length  of  array  Y  at  the  beginning  of  the  1-th  iteration 


of  the  while  loop.   We  just  proved  m.  <  (3/4)^~  n.   The  while   loop   is 
iterated  until  in.  <  p  (=  n/(log  nloglog  n)  ).   Therefore,  the  number  of 

iterations  of  this  loop  is  O(loglog  n).   Hence,  the  total  time  spent  on 

Odoglog  n)        ._, 
the   while    loop    is   <     )      0((3/4)-'-  ^n/p  +  log  n)    which   is 

0(n/p  +  log  nloglog  n)  or  O(log  nloglog  n)  . 

Step  5  takes  additional   O(log  p)   time.    Summing   up   gives   a   total 

running   time   of   O(log  nloglog  n)  using  n/(log  nloglog  n)  processors. 

We  can  simulate  this  parallel  algorithm  by   less   processors   in  the 

spirit   of   Brent's   general  theorem  (see  Appendix)  and  get  0(n/p)  time 

using  p  <  n/(log  nloglog  n)   processors.   This   can  be  alternatively 

written  as  0(n/p  +  log  nloglog  n)  time  using  p  processors. 

Remark.   The   time  estimate   for   the  [AXS-83]  sorting  algorithm 

involves  large  constants.   It  is  possible  to  use   instead   the  sorting 

algorithm  of  [BH-82]  (resp.   an  EREW  PRAM  implementation  of  the  sorting 

network  of  [Ba-68]).   It  sorts  n  elements   in  O(log  nloglog  n)   (resp. 

O(log'^n)  )   time  using  n  (resp.   n/2)  processors.   This  would  result  in 

0(n/p)  time  using  p  <  n/log  n(loglog  n)-   (resp.    n/(log  nloglog  n)  ) 

processors.   Observe   that   the  algorithm  of  [BH-82]  is  designed  for  a 

concurrent-read  exclusive-write  (CREW)  PRAM  model  of  computation,  where 

simultaneous  access  of  more  than  one  processor  is  allowed  for  read  but 

not  write  purposes. 

Appendix. 

Theorem  (Brent).  Any  synchronous  parallel  algorithm  of  time  t 
that  consists  of  a  total  of  x  elementary  operations  can  be  implemented 
by  p  processors  within  a  time  of  fx/pl  +  t  . 

Proof  of  Brent's  theorem.   Let  x.  denote  the  number  of   operations 
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performed  by  the  algorithm  in  time  i  ()  x-  =  xj  .  We  now  use  the  p 
processors  to  "simulate"  the  algorithm.  Since  all  the  operations  in 
time  i  can  be  executed  simultaneously,  they  can  be  computed  by  the  p 
processors  in  fx^/p]  units  of  time.  Thus,  the  whole  algorithm  can  be 
implemented  by  p  processors  in  time  o£  - 

f       fx./pl  <  f      (x./p  +  1)  <  fx/pl  +  t  .  O 

Remark.  The  proof  of  Brent's  theorem  poses  two  implementation 
problems.  The  first  is  to  evaluate  x.  at  the  beginning  of  time  i  in 
the  algorithm.   The  second  is  to  assign  the  processors  to  their  jobs. 

We  use  balanced  binary  trees  for  a  simple  computation  of  sums  and 
partial  sums.  Similar  routines  were  used  in  [W-79],  [CLC-81]  and 
[Vi-81]  to  mention  just  a  few. 

Input.   An  array  of  n  numbers  A( 1 ) ,A(2) , . . . ,A(n) .   Assume,  w.l.g.   that 
log^n  is  an  integer. 
Problem.   Compute  their  sum. 

Algorithm.   "Plant"  a  balanced  binary  tree  with  n  leaves.   Every  node 
of   the   tree  is  denoted  [h,j].   See  Fig.   1.  Leaf  [0,j]  corresponds  to 
A(j).    Associate   a   number   B[h,j]   with   every   node   of   the   tree. 
Initialization.   for  all  1  <  j  <  n  pardo  B [ 0  ,  j ]  : =  A(  j ) . 
for  h  :=  1  to^  log  n 
for  all  1  <  j  <  2^o§  ^  ~  ^  pardo  B[h,j]  :=  B[h-l,2j-l]  +  B[h-l,2j]. 

It  is  easy  to  verify  that  B[log  n,I]  holds  the  desired  sum. 

Think  first  about  an  n  processor  implementation  of  this  summation 
algorithm.  It  runs  in  O(log  n)  time.  Then  apply  the  proof  of  Brent's 
theorem   to   get   an  alternative  implementation  that  uses  only  n/log  n 
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processors  and  runs  in  O(log  n)  time.   This  summation  algorithm  can  be 
extended  to  solve  the  following  partial-sum  problem. 
Input.   As  for  the  summation  problem. 
Problem.   Compute  /  A(j)  for  all  1  <  i  <  n. 

Algorithm.  Perform  the  summation  algorithm  given  above.  An  additional 
"down-sweep"  of  the  tree  (from  the  root  to  the  leaves),  which  roughly 
amounts  to  reversing  the  operation  of  the  summation  algorithm,  will 
complete  the  job: 

Associate  another  number  C[h,j]  with  each  node  [h,j]. 
Initialization.   C[log  n,l]  :=.0. 
for  h  :=  log  n-1  down to  0  ^ 

for  all  1  <  j  <  2log  ^  ~  ^  par do  if"  j  is  odd 
thenC[h,j]  :=  C[h+l,(j+l)/2] 
else  C[h,j]  :=  C[h+1 , j/2]  +  B(h, j-1] . 
for  all  1  <  j  <  n  pardo  C[0,  j]  :=  C[0,  j]  +  B[0,.i]. 

C[0,j],  1  <  j  <  n,  hold  the  desired  partial-sums.  This  algorithm 
can  also  be  implemented  to  .run-  in  0(n/p  +  log  n)  time  using  p 
processors  on  an  EREW  PRAM  by  applying  Brent's  theorem. 
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